Classifying Network Data with Deep Kernel Machines 

Xiao Tang and Mu Zhu 
Department of Statistics and Actuarial Science 
University of Waterloo 
Waterloo, Ontario, Canada N2L 3G1 

January 22, 2010 
Abstract 

Inspired by a growing interest in analyzing network data, we study the problem of node classifi- 
cation on graphs, focusing on approaches based on kernel machines. Conventionally, kernel machines 
are linear classifiers in the implicit feature space. We argue that linear classification in the feature 
space of kernels commonly used for graphs is often not enough to produce good results. When this 
is the case, one naturally considers nonlinear classifiers in the feature space. We show that repeating 
this process produces something we call "deep kernel machines." We provide some examples where 
deep kernel machines can make a big difference in classification performance, and point out some 
connections to various recent literature on deep architectures in artificial intelligence and machine 
learning. 
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1 Introduction 



Given a graph, Q, let V(Q) = {vi, v 2 , v ra } denote its set of vertices (nodes). Associated 
with each node Vj is a class label, say j/j G {1, 2, C}. A small number of these labels are 
known; others are unknown. We use the notation 

V bs{Q) = {vj G : yi observed} and V miss {Q) = {v» G F(<?) : missing} 

to denote subsets of nodes whose labels are known and unknown, respectively. The problem 
that we study in this article is that of predicting the unknown labels based on how the nodes 
are connected to each other. 

For example, Q may be a protein interaction network. Some proteins are known to be 
associated with certain biological functions (e.g., cell communication), and we may be in- 
terested in predicting which of the remaining proteins are also associated. Another example 
of Q is a social network, where the nodes are individuals. Some individuals are known to 
be involved with certain activities (e.g., belonging to a certain club, or interested in certain 
products) , and we may wish to assess the likelihood that the remaining individuals are also 
involved. For simplicity, we will assume C = 2, i.e., there are only two classes, but our ideas 
also apply when C > 2. 

We develop something we call "deep kernel machines" (DKMs) on graphs. Though we 
focus on the problem of node classification on graphs, it will become clear by the end of this 
article (Section 5) that the idea of DKMs does not really depend on the graph structure of 
the data. The main reasons why we choose to start with the node classification problem 
on graphs are: (a) it is a relatively new problem and not as widely studied as some of 
the more conventional classification problems (Kolaczyk 2009); and (b) it is a problem for 
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which kernel machines are particularly well suited (Shawe- Taylor and Cristianini 2004). 
Experiments with data structures other than graphs will be left as future work. 



2 Kernel machines and graphs 

Lately, largely due to support vector machines (see, e.g., Vapnik 1995; Cristianini and 
Shawe- Taylor 2000), kernel machines have become very popular in machine learning (Shawe- 
Taylor and Cristianini 2004; Bishop 2006). Typically, a kernel machine has the form, 



where K(-, •) is a kernel function, and the coefficient on often depends on the class label y^. 

A support vector machine (SVM) is sometimes called a sparse kernel machine because 
the optimization problem it solves will cause many of the a^s to be 0. As a result, the 
decision function (1) depends only on those observations (vj,j/j) with ctj > 0, and hence 
the name "sparse." In order to crystalize the gist of our ideas, we shall mostly work with 
a much simpler kernel machine (Section 2.1), but other kernel machines such as SVMs also 
can be used. 

As noted by Shawe- Taylor and Cristianini (2004), kernel machines are particularly well 
suited to analyze non-vectorial data such as graphs. Before we talk about kernels on graphs, 
a few graph-theoretic concepts are needed. The adjacency matrix, Ag, for graph Q is defined 
as 






(1) 



vi£V obs (g) 




0, if not. 



1 , if there is an edge between Vj and vj ; 
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Let 

di = J2A g (i,j) 

be the degree of node Vj, or simply the number of nodes it is adjacent to. The graph 
Laplacian matrix Lg for Q is defined by 

Lg = T>g - Ag, where Tig = diagjdi, d 2 , d n }. 

In other words, 

Lg(i,j) = < 



di, if i = j; 



-Ag(i,j), ifi/j. 

With these basic notions, we are now ready to talk about kernels on graphs. For graph 
data, one often uses a so-called diffusion kernel (Lafferty and Lebanon 2005), defined as 

K e = exp(-/3L e ) = £ (2) 

' ml 

m=0 

where (3 > is a tuning parameter. The meaning and intuition behind the diffusion kernel 
require a relatively long explanation, which we won't go into. Roughly speaking, to measure 
the similarity between Vj and Vj , the diffusion kernel takes into account the number of paths 
of length m between Vj and vj, for all m, and gives shorter paths more weight (e.g., Kolaczyk 
2009, Section 8.4.2). 

More generally, one is not required to use the adjacency matrix or the graph Laplacian 
to construct the diffusion kernel. Instead, a diffusion kernel (2) can be constructed as long 
as lig is a valid similarity matrix (see, e.g., Shawe- Taylor and Cristianini 2004); we will 
show an example below (Section 4.2). 

To compute the diffusion kernel, let Lg = \J'E\J T be the spectral decomposition of the 
Laplacian matrix, where I] = diag(s m ). Using the fact that \Jg has the same eigenvectors 
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for all m, the diffusion kernel can be computed by 

Kg = Udiag (V^™) U T 
We will use Kg(i,j) and Kg(vi,Vj) interchangeably. 

2.1 A simple kernel machine 

Since Kg(vj, Vj) measures the similarity between nodes Vj and \j, to classify vo € V m i ss (G) 
we can clearly use the function, 



where n\ and ri2 are total number of nodes in V b s (Q) with class label 1 and 2, respectively. 
For vo, this is simply the difference between its average similarity to class 1 and its average 
similarity to class 2. For example, we can classify vo to the class which it is more similar 
to, i.e., 



for some thresholding constant c. Below, we will sometimes drop the subscript "vj G 
V bs(G)" from the summation in (3). 

It is easy to see that the function fg(\o) m (3) is of the general form (1), with ao = 0, 
ai = 1/ni or —l/n2 depending on whether yi = 1 or 2, and K = Kg. In other words, 
(3) is a simple kernel machine. We shall mostly work with this simple kernel machine 
because it can easily be constructed without invoking expensive optimization procedures in 
order to determine the coefficients, ao, ai,ct2, ...,a n . A more sophisticated kernel machine 




(3) 



1, if fg(v ) > c, 



= < 



(4) 



2, if otherwise, 
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such as the SVM, on the other hand, would require quadratic programming to find these 
coefficients. However, though we use (3) for convenience, we emphasize the framework we 
develop below does not preclude us from using other kernel machines such as SVMs. 

3 Deep kernel machines on Q 

A key idea behind kernel machines is that kernels can be regarded as calculating inner 
products in an implicit feature space, call it T . That is, 

K eK v i> v j) = (<X v i)><M v j))> where 4> : V(Q) ^ T . 
This means the kernel Kg necessarily induces a distance function in J 7 , 

dr(vi,Vj) = ||0(vj) - 4>(vj)\\ 

= y/ittviUivi)) ~ 2<0(v i ),0(v J -)) + <<Xv,),<Xv,)> 

= \/Kg ( V, , V, ) - 2Kg ( V, , 7j ) + Kg ( Vj , 7j ) (5) 

3.1 Linear classification in J 7 

Using the distance dj? — more specificly the squared distance djr, the decision rule (4) is 
equivalent to nearest-centroid classification in the feature space T . To see this, notice that 

— Y] <K V ») and — V 

Vi=l 2/i=2 

are class centroids in T . Nearest-centroid classification simply declares jjQ = 1 if vo is closer 
to the centroid of class 1, i.e., if 

<Hv ) - — J2 ^( Vi ) 

s/«=i 
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< 



(vo) - ^ £ <K< 



Vi=2 



This is equivalent to 



(0(vo)^(v o ))-|-E^)'^ v o)) + i E <<K v *),<Kv,)> 

"1 , "l 1 



(0(v o ),0(v o ))- A^ (0(Vi))0(vo)) + J_ £ M-ViMi-Vj)) <0. 

n 2 — o % — 



Cancelling out (0(vq), <K v o)) and dividing by —2, we obtain 



1 



Kg (v ,Vi) 



1 



J^Kg (v ,Vi)-c(y o6s (a)) >0, 



J/i=l 



where 



c(V obs (g)) = ^ ^(Vj.Vj) 

1 !/»,%■=! 



2n| 



1 



Vi,yj= 2 



E K e( v i' v i) 



is a constant that depends only on V^, s (<?) and not on vo- Clearly, this is equivalent to (4). 

Being equivalent to nearest-centroid classification, the kernel machine (3) is a linear 
classifier in the feature space T . In fact, most kernel machines are linear in the feature 
space, including SVMs. 

3.2 Nonlinear classification in T 

However, it is quite possible that a linear classifier in T is not sufficient; we will show some 
examples below (Section 4). But, in principle, there is nothing that prevents us from using 
other, more flexible classifiers in T . For example, using the distance dp, we may consider 
a classifier based on kernel density estimates. Let 




vieV obs (Q) 

Vi=k 



(6) 



7 



be a kernel density estimate of the distribution for class k. Many kernel functions can be 
used for density estimation, e.g., 

^ ( ^ ) = 71^) exp ("^))' (7) 

where h{T) is a bandwidth parameter, which serves to scale the distance dj-. We shall write 

Kj-(vj,Vj) = K h{T) (d T (vi,Vj)) . (8) 

Using (7) for K h ^, K_f is nothing but the well-known radial-basis or Gaussian kernel, 
except it uses the distance function djr rather than a distance defined on the original graph 
Q. Therefore, (6) is a density estimate in the space T rather than on the original graph Q. 
The subscript T and the notation h(T) are used to emphasize this fact and to differentiate 
Kjf from Kg, the kernel on the original graph Q that induced the space T . 

Based on the kernel density estimates in (6), for each vo £ V m i ss {Q) we can predict its 
class label yo depending on whether pi(vo) — p2(^o) is positive or negative. In other words, 
the decision function is simply 

/f(vo) = — E K^(v ,v,)-- K H v o,v*). (9) 

ni — no — 

Vi=l S/i=2 

It is easy to see that (9) is another kernel machine of the same form as (3); the only difference 
is that (9) uses the kernel Kj- whereas (3) uses the kernel Kg. 

3.3 Deep kernel machines 

Let us summarize what we have said so far. The space T is the implicit feature space 
for Kg. A kernel machine fg (3) using the kernel Kg is a linear classifier in T . If linear 
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classifiers are not sufficient in J 7 , we can relax linearity and choose to work with a nonlinear 
classifier, e.g., by constructing kernel density estimates in T via the implied distance metric 
dj^(\i, Vj) — equation (5). This gives rise to a new kernel machine fjr (9), using the kernel 
Kj-(vj, Vj) — equation (8). If we use (7) for kernel density estimation, the kernel updating 
formula from Kg to Kjf is simply, putting (5), (7), and (8) together, 

K^, Vj ) = -^^exp f- K ^^ ~ 2K g (v^) +K g (v J -,v J -) \ 

V J> vSC? 7 ) V 2/i 2 (J-) J v 1 

The choice of h(T) will be discussed below (Section 3.4). 

However, there is no reason why the process must end here. The kernel Kjr has its 
implicit feature space as well; let's call it T 2 . The kernel machine fjr (9) using the kernel 
Kjr is a linear classifier in if 2 . We can relax linearity in J 7 " 2 , if necessary, and choose to 
work with a nonlinear classifier, again, by constructing kernel density estimates in T 2 via 
the implied distance metric, 



dj*(Vi,Vj) = yKjr(Vi,Vj) - 2K jr(Vi,Vj) +Kjr(Vj,Vj). 

By the same argument, this would give us yet another kernel machine, say /j-2, of exactly 
the same form as fg (3) and fjr (9), except it would be using the kernel 

Kj2 (vj = K h(T2) (d T 2 ( Vj , Vj )) 

1 ^( K^(vi,v i )-2K^(v i ,v i ) + K^(v j ,v i ; 



It is easy to see that this process can be repeated recursively (see Table 1). We refer 
to kernel machines generated by this recursive process as "deep kernel machines" (DKMs). 
The one using the original kernel Kg is a referred to as a level-0 DKM; the one using 
the kernel K^, a level- 1 DKM; the one using the kernel Kjr2, a level- 2 DKM; and so on. 



Notice that the DKM algorithm presented in Table 1 is slightly more general than what 
we have discussed above. In our discussion, we have focused on a specific base kernel 
machine (Section 2.1), but one can certainly use other base kernel machines, e.g., SVMs 
(see Section 4.4 below). 

3.4 A heuristic for choosing ^(J 7 ) 

To carry out density estimation in J 7 , a bandwidth parameter h(T) must be specified; see 
(6) and (7). While users are certainly free to optimize this parameter in practice, this can be 
tedious for DKMs because, as we go from Q to J 7 , J 7 " 2 , J 73 , there is a bandwidth parameter 
for each space, h{J : ),h(J : ' 2 ),h(J :Z ), so a heuristic is desired. A reasonable heuristic is: 



That is, h(F) can be chosen to be the average pair-wise distance in the space of T . We use 
this heuristic in all of our experiments below. 

4 Experiments 

In this section, we describe a few experiments and show that DKMs are useful. 
4.1 Enron email data 

In 2001, a large USA-based gas and electricity company named Enron was found guilty 
of serious accounting frauds, a case that caught worldwide attention. As part of the in- 
vestigation, the US Federal Energy Regulatory Commission confiscated its corporate email 




(11) 
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Table 1: Pseudo code for deep kernel machines (DKMs). 



function BaseKernelMachine(K, V b s (Q), V m i ss (Q)) 
for (every v G V miss {g)) { 

f(y) = a + ^2 «jK(v,Vi), 

Viev obs (g) 

e.g., ao = 0, ai = 1/ni if = 1 and a, = -I/712 if y» = 2. 

} 

return /; 
end function 

function GetKernel(Kg, level) 
if (l ev d == 0) { 
return Kg; 

} 

else { 

compute djr according to equation (5): 



<Mvj,Vj) = ^/Kg(vi,Vi) -2K e (vi,Vj) + K (vj,Vj); 
choose h(T) according to equation (11): 

vi,vjev obs (g) 

compute Kj- according to equations (8) and (7): 

Kjr(Vi,Vj) = K h{T) (djr(Vi,Vj)) ; 

return GetKernel(Kjr, ZeveZ — 1); 

} 

end function 

function DeepKernelMachine(Kg, V^, s (<?), K„i ss (<7), level) 

return BaseKernelMachine(GetKernel(Kg , level), V b s (G), V m i ss (Q)); 

end function 
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database and made it publicly available. Preprocessed versions of these data can be ob- 
tained from http://cis.jhu.edu/~parky/Enron/enron.html. In particular, there is a 
184 x 184 adjacency matrix, Ag, that indicates whether there was email communication 
between any two of 184 unique email accounts. Our initial Kg is simply a diffusion kernel 
(2) based on this adjacency matrix, but we removed two accounts that never sent an email 
to another account. The status of these 184 email account owners are also available, which 
we use to create two different classification tasks (see Table 2). 

Table 2: Part of Enron email data used for tasks 1 and 2. We removed two accounts that 
never sent an email to another account. 



Node Status 


N 


Class Label 
Unbalanced Balanced 


CEO 


5 


1 1 


President 


5 


1 1 


Managing Director 


6 


1 1 


Director 


14 


2 1 


Vice President 


30 


2 1 


Manager 


16 


2 1 


Lawyer 


1 


2 1 


Employee 


40 


2 2 


Trader 


11 


2 2 


Other 


54 


2 2 


Total 


182 


16 vs 166 77 vs 105 
(Task 1) (Task 2) 



4.2 Lazega lawyer data 

Lazega (2001) studied collaborative working relationships and social interactions among 
members of a New England law firm. There were 36 partners in the firm. The 36 partners 
were interviewed and asked to express their opinions on various issues regarding how the 
law firm should be managed. One of the issues had to do with workflow inside the firm 
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(Lazega 2001, Chapter 8). Some favored the status quo (yj = 1) while others favored less 
flexible workflow {yi = 2). As our third classification task (see Table 3), we try to predict 
the partners' position on this particular issue based on their social interactions and working 
relationships, e.g., whether any two partners worked together or considered themselves to be 
friends. Our initial Kg is a diffusion kernel (2) based on a similarity (rather than adjacency) 
matrix Ag, defined as follows: 

j) = 0.5 X lfriends(i,j) + 0.5 X Icollabaratedihj), 
where I friends (i, j) = 1 if Vj and Vj were friends and if not; and likewise for I C ollaborated(i, j)- 
Table 3: Part of Lazega lawyer data used for task 3. 



Node Position 


TV 


Class Label 


Favors status quo 


20 


1 


Favors less flexible workflow 


16 


2 


Total 


36 


20 vs 16 






(Task 3) 



4.3 Performance measure 

Classification task 1 (Table 2) is a highly unbalanced problem. For this task, we use the 
average precision (e.g., Zhu et al. 2006, Appendix A), or simply AP, to evaluate performance. 
The AP is a widely used criterion in the information retrieval community, and is particularly 
suitable for unbalanced classification problems. Classification tasks 2 and 3 (Tables 2 and 3) 
are relatively balanced problems. For these two tasks, we use the area under the receiver- 
operating characteristic (ROC) curve (Pepe 2003), or simply AUC (for "area under the 
curve"), to evaluate performance. The main reason for using the AUC and the AP (rather 
than, e.g., total misclassification error) is because they are not affected by the thresholding 
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constant c in equation (4). Table 4 summarizes the main features of the three tasks. 
Table 4: Summary of classification tasks. 





% 


Data 




Performance 




(Vi = i) 


Set 


Ag 


Measure 


Task 1 


8.8 


Enron 


adjacency matrix 


AP 


Task 2 


42.3 


Enron 


adjacency matrix 


AUC 


Task 3 


55.6 


Lawyer 


custom similarity matrix 


AUC 



4.4 Base kernel machines 

We use two types of base kernel machines to run DKMs: the simple kernel machine (3) and 
the SVM. Both are linear in the feature space, but SVM directly goes after the optimal 
hyperplane. Notice that, to fit an SVM, one must specify the amount of penalty on the sum 
of slack variables, often called the "cost" parameter in most SVM packages. Every time an 
SVM is fitted, we simply choose the best "cost" parameter among a wide range of values: 

1(T 5 , 1(T 4 , 1(T 3 , 1(T 2 , 0.05, 0.1, 0.5, 1, 2, 5, 10, 50, 100. 

This gives the SVM an unfair advantage, but, since we are using SVMs as a benchmark, 
it is well understood that giving the benchmark an unfair advantage will only lead us to 
more conservative conclusions. In reality, this extra "cost" parameter in the SVM must be 
chosen by cross validation on V b s (Q). 

4.5 Results and conclusions 

Figure 1 shows the average results over 25 random splits of V(Q) into V b s (G) and V m iss(G)- 
The random splits are stratified by class label so that the fraction of nodes belonging to 
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each class is roughly the same on both V b s {G) and V m i ss {Q). The main conclusions we can 
draw from Figure 1 are as follows: 

(CI) Though it goes directly after an "optimal" linear classifier, the SVM is not necessarily 
a better base kernel machine than a simple kernel machine such as (3). Both are linear 
in the feature space. It is more important to be in the "right" feature space than to 
use an "optimal" linear classifier. Using the "optimal" linear classifier in the "wrong" 
feature space is not going to give you good results. DKMs address this issue directly 
by providing a recursive algorithm to look for the "right" feature space. 

(C2) When the initial kernel Kg is badly specified, e.g., if the tuning parameter (3 is not well 
chosen for the underlying prediction task, DKMs can often boost up the performance 
significantly. This shows that DKMs have an attractive "automatic kernel correction" 
capability. When linear classification in the initial feature space Kjf is not enough 
to produce good results, it often pays to relax linearity and to go up to higher- level 
feature spaces. DKMs provide an automatic way to do so. 

5 Discussion 

To paraphrase our conclusions (CI) and (C2) above, we have essentially argued that we 
should put more emphasis on finding the "right" feature space rather than finding the 
"optimal" linear classifier (perhaps in the "wrong" feature space). One can use DKMs to 
do this. While the automatic and recursive kernel correction formula (10) is attractive, 
there clearly remains one important question that we haven't quite addressed: how deep 
should we go? 
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Taskl 

Simple Base Kernel Machine 



Taskl 

SVM as Base Kernel Machine 




0.001 0.005 



0.050 
P 



0.001 0.005 



Task 2 

Simple Base Kernel Machine 




0.005 0.050 0.500 

P 

Task 3 

Simple Base Kernel Machine 




0.050 
P 



Task 2 

SVM as Base Kernel Machine 



-e- level 

-i=- level 1 

+ level 5 

x - level 20 



x- -x- - -x- - x x X X X * * 



0.001 0.005 



0.050 
P 



Task 3 

SVM as Base Kernel Machine 



level 
level 
level 
level 




0.001 0.005 



0.001 0.005 



Figure 1: Average performance over 25 random splits of V(Q) into V b s {Q) and V m i ss {Q). 
Horizontal axis (logarithmic scale): tuning parameter /3 for the initial diffusion kernel Kg; 
see equation (2). Vertical axis: performance measure, e.g., AP or AUC; see Section 4.3. 
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Before we address this question, we first briefly mention interesting connections between 
our work and some recent literature on deep architectures in machine learning. The neural 
network was a leading algorithm for machine learning during the 1980's, but it did not 
enjoy as wide a success as was initially anticipated. The main reason is because the back 
propagation algorithm is not practical for training neural networks that are more than a few 
layers deep. Recently, many arguments (e.g., Hinton and Salakhutdinov 2006; Sutskever and 
Hinton 2008; Bengio 2009) have been made that deep neural networks (i.e., neural networks 
with many layers) are necessary, and practically realistic algorithms have also emerged (e.g., 
Hinton et al. 2006; Larochelle et al. 2009). Our work provides further support for the idea 
of deep architectures. 

By definition, the architecture of a deep neural network is necessarily complex. One has 
to make many decisions. How many layers? How many hidden components for each layer? 
In the landmark article (Hinton and Salakhutdinov 2006) on deep neural networks that 
appeared in the prestigious journal, Science, the authors showcased deep neural nets for a 
number of different tasks. A very striking feature of that article is that the authors used 
vastly different deep architectures for the different tasks, but there was little explanation 
on how those architectural decisions were made. We asked the first author of the Science 
article in person, after he delivered a seminar on the very subject. His answer was: one 
simply tries different architectures and picks the one that gives the best results. While this 
is not entirely satisfactory, we think such a limitation alone is no reason for anyone to deny 
that deep neural networks are a major advance in modern machine learning research. One 
cannot solve all problems at once. New ideas always lead to new problems, and that's the 
very nature of scientific research. 
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At this moment, we don't have an entirely satisfactory answer to the question of how 
deep a DKM one should use, except that we have empirically observed diminishing returns 
as we go to higher and higher levels, but this limitation alone in our work is no reason for 
us to reject the fact that DKMs can be quite useful. 

Finally, it is not hard to see that the development of these DKMs (Section 3) does not 
rely on Q being a graph. For example, if we abuse our notation and allow Q to denote the 
usual ^-dimensional Euclidean space, then we simply have a usual classification problem 
- Vobs(G) simply becomes the set of training data and V m i ss (Q), the set of unlabelled 
observations to be classified. Of course, in that case Kg will no longer be the diffusion 
kernel, but, regardless of what it is, a level-0 kernel machine using K.g will still be linear in 
its implicit feature space T . Using the distance function djr, we can still do kernel density 
estimation in the space of J-, and obtain a level-1 kernel machine using a new kernel Kjr. 
In other words, the idea of DKMs is general and not restricted to node classification on 
graphs. Whether they are actually useful for data structures other than graphs remains to 
be seen. We leave this to further investigation. 

6 Summary 

We have described the idea of using deep kernel machines for node classification on graphs. 
We have conducted a few experiments to show that linear classification in the implicit 
feature space of kernels commonly used for graph data (e.g., the diffusion kernel) is often 
not enough. When this is the case, one can apply the "kernel trick" again in the implicit 
feature space itself. Repeating this process leads to deep kernel machines (DKMs). Our 
experiments have shown that DKMs' recursive, automatic kernel correction capability is 
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especially useful when the initial kernel Kg is not well specified. While the work we reported 
here is just a beginning and there remains much to be done, our results lend support to 
the idea of using deep architectures for machine learning that has recently emerged in the 
literature. 
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